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ABSTRACT 

In this work we apply model averaging to parallel training 
of deep neural network (DNN). Parallelization is done in a 
model averaging manner. Data is partitioned and distributed 
to different nodes for local model updates, and model averag¬ 
ing across nodes is done every few minibatches. 

We use multiple GPUs for data parallelization, and Mes¬ 
sage Passing Interface (MPI) for communication between 
nodes, which allows us to perform model averaging fre¬ 
quently without losing much time on communication. We 
investigate the effectiveness of Natrual Gradient Stocha- 
sitc Gradient Descent (NG-SGD) and Restricted Boltzmann 
Machine (RBM) pretraining for parallel training in model¬ 
averaging framework, and explore the best setups in term of 
different learning rate schedules, averaging frequencies and 
minibatch sizes. It is shown that NG-SGD and RBM pretrain¬ 
ing benefits parameter-averaging based model training. On 
the 300h Swithboard dataset, a 9.3 times speedup is achieved 
using 16 GPUs and 17 times speedup using 32 GPUs with 
limited decoding accuracy loss. 

Index Terms — Parallel training, model averaging, deep 
neural network, natural gradient 

1. INTRODUCTION 

Deep Neural Networks (DNN) has shown its effeciveness in 
several machine learning tasks, espencially in speech recog¬ 
nition. The large model size and massive training examples 
make DNN a powerful model for classification. However, 
these two factors also slow down the training procedure. 

Parallelization of DNN training has been a popular topic 
since the revival of neural networks. Several different strate¬ 
gies have been proposed to tackle this problem. Multiple 
thread CPU parallelization and single GPU implementation 

* This work is not submitted to peer-review conferences because the au¬ 
thors think it needs more investigation. The authors are in lack of resources 
to perform further exploration. However, we welcome any comments and 
suggestions. 


are compared in ||T]|2l, and it is shown that single GPU could 
beat multi-threaded CPU implementation by a factor of 2. 

Optimality for parallelization of DNN training was ana¬ 
lyzed in 0, and based on the analysis, a gradient quantization 
approach (1-bit SGD) was proposed to minimize communica¬ 
tion cost a. It shows that 1 bit quantization can effectively 
reduce data exchange in an MPI framework, and a 10 times 
speed-up is achieved using 40 GPUs. 

DistBelief proposed in 0 reports that 8 CPU machines 
train 2.2 times faster than a single GPU machine on a moder¬ 
ately sized speech model. Asynchronous SGD using multiple 
GPUs achieved a 3.2x speed-up on 4 GPUs 0. 

A pipeline training approach was propoased in Q and a 
3.3x speedup was achieved using 4 GPUs, but this method 
does not scale beyond number of layers in the neural network. 

A speedup of 6x to 14x was achieved using 16 GPUs on 
training convolutional neural networks 0. In this approach, 
each GPU is responsible for a partition of the neural network. 
This approach is more useful for image classification where 
local structure of the neural network could be exploited. For 
a fully connected speech model, a model partition approach 
may not be able to contribute as much. 

Distributed model averaging is used in Elllol, and a fur¬ 
ther improvement is done using NG-SGD im. In this ap¬ 
proach, separate models are trained on multiple nodes using 
different partitions of data, and model parameters are aver¬ 
aged after each epoch. It is shown that NG-SGD can effec¬ 
tively improve convergence and ensure a better model trained 
using the model averaging framework. 

Our approach is mainly based on the NG-SGD with model 
averaging. We utilize multiple GPUs in neural networks train¬ 
ing via MPI, which allows us to perform model averaging 
more frequently and efficiently. Unlike the other approach 
0, we do not use a warm-up phase where only single thread 
is used for model update. (Admittedly, this might lead to 
further improvement). In this work, we conduct a lot of ex¬ 
periments and compare different setups in model averaging 
framework. 

In Section 2, we introduce related works on NG-SGD. 



Section 3 describe the model averaging approach and some 
intuition on the analysis. Section 4 records experimental re¬ 
sults on different setups and Section 5 concludes. 

2. RELATIONSHIP TO PRIOR WORKS 

To avoid confusion, we should mention that Kaldi lfl^ con¬ 
tains two neural network recipes. The hrst implementation 
l^is described in |jT3l which supports Restricted Boltzmann 
Machine pretraining iflTl and sequence-discriminative train¬ 
ing Qa. It uses single GPU for SGD training. The second 
implementation 1^ 13 was originally designed to support par¬ 
allel training on multiple CPUs. Now it also supports multiple 
GPUs for training using model averaging. By default, it uses 
layer-wise discriminative pretraining. 

Our work extends the hrst implementation so that it can 
utilize multiple GPUs using model averaging. We use MPI in 
implementation, so hie I/O is avoided during model averag¬ 
ing. This allows us to perform model averaging much more 
frequently. 


formula could be written as 
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where 0 is the model parameter and a is learning rate. If 
changes in model parameter 9 is limited within n updates, 
this approach could be seen as an approximation to gradient 
averaging. 

Second, it is shown that model averging for convex mod¬ 
els is guaranteed to converge iniiisi. It is suggested that un¬ 
supervised pretraining guides the learning towards basins of 
attraction of minima that support better generalization from 
the training data set; QS). 

Fig[T]is an example of all-reduce with 4 nodes. This op¬ 
eration could be easily implemented by MPLAllreduce. 


3. DATA PARALLELIZATION AND MODEL 
AVERGING 

SGD is a popular method for DNN training. Even though 
neural network training objectives are usually non-convex, 
mini-batch SGD has been shown to be effective for optimiz¬ 
ing the objective ifTbl . Roughly speaking, a bigger minibatch 
size gives a better estimate of the gradient, resulting in a bet¬ 
ter the converge rate. Thus, a straight forward idea for paral- 
lellization would be distributing the gradient computation to 
different computing nodes. In each step, gradients of mini¬ 
batches on different nodes are reduced to a single node, av¬ 
eraged and then used to update models in each node. This 
method, i.e. gradient averaging, can compute the gradient ac¬ 
curately, but it requires heavy communication between nodes. 
Also, it is shown that increasing minibatch size does not al¬ 
ways beneht model training ifTbl , especially in early stage of 
model training. 

On the other hand, if we choose to average the parameters 
rather than gradients, it is not necessary to exchange data that 
often. Currently, there is no straight forward theory that guar¬ 
antees convergence, but we would like to explore a bit why 
this strategy should work, just as we observe in the experi¬ 
ments. 

First, in the extreme case where model parameters are av¬ 
eraged after each weight update, model averaging is equiva¬ 
lent to gradient averaging. Furthermore, if model averaging 
is done every n minibatch based weight update, model update 


^Location in code: src/{nnet,nnetbin} 
^Location in code: src/nnet2,nnet2bin 
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Fig. I. All-reduce network 


4. NATURAL GRADIENT FOR MODEL UPDATE 

This section introduces the idea proposed in ifTTIl . 

In stochastic gradient descent (SGD), the learning rate is 
often assumed to be a scalar at that may change over time, 
the update formula for model parameters 9t is 

9t+i = 9t + atgt (3) 

where gt is the gradient. 



























However, according to Natural Gradient idea ll20l 1271 . it 
is possible to replace the scalar with a symmetric positive def¬ 
inite matrix Et, which is the inverse of the Fisher information 
matrix. 

9t+i =9t + atEtgt (4) 

Suppose X is the variable we are modeling, and f{x; 9) is 
the probability or likelihood of x given parameters 9, then the 
Fisher information matrix I{9) is defined as 


E 


^log f{x-,9)\ (^\ogf{x]9) 


Tn 


(5) 


For large scale speech recognition, it is impossible to esti¬ 
mate Fisher information matrix and perform inversion, so it is 
necessary to approximate the inverse Fisher information ma¬ 
trix directly. Details about the theory and implementation of 
NG-SGD could be found in ifTTl . 


RBM-initialization in model averaging framework is reported 
in Section 1531 

The server hardware used in this work is Stampede 
(TACC) (URL; https://portal.xsede.org/tacc-stampede). It 
is a Dell Linux cluster provided as an Extreme Science Engi¬ 
neering Discovery Environment (XSEDE) digital service by 
the Texas Advanced Computing Center (TACC). Stampede 
is conhgured with 6,400 Dell DCS Zeus compute nodes, the 
majority of which are conhgured with two 2.7 GHz E5-2680 
Intel Xeon (Sandy Bridge) processors and one Intel Xeon 
Phi SEIOP coprocessor. 128 of the nodes are augmented 
with an NVIDIA K20 GPU and 8 GB of GDDR5 memory 
each, which we use for neural network training in this work. 
Stampede nodes run Linux 2.6.32 with batch services man¬ 
aged by the Simple Linux Utility for Resource Management 
(SLURM). 

5.2. Switchboard Results 


5. EXPERIMENTAL RESULTS 


5.1. Setup 

In this work, we report speech recognition results on the 
300 hour Switchboard conversational telephone speech task 
(Switchboard-1 Release 2). We use MSU-ISIP release 
of the Switchboard segmentations and transcriptions (date 
11/26/02), together with the Mississippi State transcripts2 
and the 30Kword lexicon released with those transcripts. 
The lexicon contains pronunciations for all words and word 
fragments in the training data. We use the Hub5 00 data for 
evaluation. Specihcally, we use the the development set and 
Hubs 01 (LDC2002S13) data as a separate test set. 

The Kaldi toolkitnH is used for speech recognition 
framework. Standard 13-dim PLP feature, together with 
3-dim Kaldi pitch feature, is extracted and used for maximum 
likelihood GMM model training. Eeatures are then trans¬ 
formed using LDAh-MLLT before SAT training. After GMM 
training is done, a tanh-neuron DNN-HMM hybrid system is 
trained using the the 40-dimension transformed fMLLR (also 
known as CMLLR ll22ll l feature as input and GMM-aligned 
senones as targets. fMLLR is estimated in an EM fashion for 
both training data and test data. A trigram language model 
(LM) is trained on 3M words of the training transcripts only. 

Work in this paper is built on top of the Kaldi nnetl setup 
and the NG-SGD method introduced in nnet2 setup. Details 
of DNN training follows Section 2.2 in ifTJI . In this work, we 
use 6 hidden layers, where each hidden layer has 2048 neu¬ 
rons with sigmoids. Input layer is 440 dimension (i.e. the 
context of 11 fMLLR frames), and output layer is 8806 di¬ 
mension. Mini-batch SGD is used for backpropagation and 
the minibatch is set to 1024 for all the experiments. By de- 
fult, DNNs are initialized with stacked restricted Boltzmann 
machines (RBMs) that are pretrained in a greedy layerwise 
fashion m. Comparison between random initialization and 


Fig. shows the scaling factor and speedup plot for model 
averaging experiments. As is shown in the graph, a speedup 
of 17 could be achieved when 32 GPUs are used. Table [T] 



Nunber of GPUs 


Eig. 2. Scaling factor and speedup factor v.s. number of gpus 

shows the main decoding results for DNNs trained using dif¬ 
ferent number of GPUs. In general, decoding results of DNNs 
trained model averaging degrades 0.30.4 WER, depending on 
the number of GPUs used. 

5.3. Initialization Matters 

Tablej^compares random initialization with Restricted Boltz¬ 
mann Machine (RBM) based initialization. 

As we can see in the table, random initialization is worse 
than DNN with RBM pretraining by 0.9/0.6 in single GPU 
case. While in model averaging setup, random initialization 





Nodes 

Data 

1 

2 

4 

8 

16 

32 

SWB 

14.7 

- 

- 

15.1 

15.1 

15.2 

CallHM 

26.8 

- 

- 

27.4 

27.0 

27.1 

SWB 

16.1 


- 

16.4 

16.2 

16.4 

SWB2P3 

21.0 

_ 

- 

21.8 

21.7 

21.7 

SWB-Cell 

27.4 

- 

- 

27.3 

27.4 

27.8 


Table 1. Comparison of WERs using different number of 
GPUs 



SWB 

CallHome 

Nodes 

1 

32 

1 

32 

random init 

15.6 

16.4 

27.4 

28.8 

RBM pretraining 

14.7 

15.2 

26.8 

27.1 


Table 2. Comparing RBM pretraining with random initializa¬ 
tion 

becomes even worse - 0.3/0.9 point more degradation on 
WER. 

5.4. Averaging frequency 

Averaging frequency here is defined as the number of minibatch- 
SGD performed per model averaging. Due to the limitation of 
computing resource, we only did preliminary experiments on 
this. Minibatch size of 1024 is set as default, and we compare 
averaging frequency of 10 and 20. It is shown in Table|^that 
an averaging frequency of 10 gives slight worse speedup but a 
better decoding WER. The tradeoff between lower averaging 
frequency (i.e. better speedup) and better training accuracy 
is within expectation in that frequent model averaging means 
steady gradient estimation. 

5.5. Minibatch Size 

Table [^compares two different minibatch size in model aver¬ 
aging setup. 

5.6. Learing Rate Schedule 

Initial learning rate is increased in porportion to number of 
threads in model averaging setup. The reason for this is 
straight forward: Assume we have n minibatches of data 
for model training. When the model is trained using single 
thread, it gets updated n times. When data is distributed to m 



SWB 

CallHome 

Speedup 

nodes 

1 

16 

1 

16 


256 

15.3 

15.6 

26.8 

27.3 

- 

1024 

14.7 

15.1 

26.8 

27.0 

9.32 


Table 4. Comparing different minibatch size 



SWB 

CallHome 

Nodes 

1 

16 

1 

16 

Newbob 

14.9 

15.4 

26.6 

27.2 

exponential 

14.7 

15.1 

26.8 

27.0 


Table 5. Comparing learning rate schedule 


machines, then each model gets updated n/m times. Since 
the effect of model averaging is mostly aggregating knowl¬ 
edge learnt from different data partition, the absolute change 
of model shall be compensated by m times. 

We compare two learning rate schedules in this section. 
The first one is the default setup used in Kaldi nnetl (New- 
bob). It starts with a initial learning rate of 0.32 and halves 
the rate when the improvement in frame accuracy on a cross- 
validation set between two successive epochs falls below 
0.5%. The optimization terminates when the frame accuracy 
increases by less than 0.1%. Cross-validation is done on 10% 
of the utterances that are held out from the training data. 

The second learing rate schedule is exponentially decay¬ 
ing. This method is used in ll23l fTTI and is shown to be su¬ 
perior to performance scheduling and power scheduling. In 
this work, it starts with the same initial learning rate as the 
first method (Newbob), and decrease to the final learning rate 
(which is set to be 0.01 * initial learning rate). The number of 
epochs is set to 15 in this task, which is set to be the same as 
Newbob scheduling. 

As is shown in Table|^ these two learning rate scheduling 
methods give similar decoding results. However, exponential 
learning rate might need more tuning since it requires a initial 
learning rate, a final learning rate and predefined number of 
epochs to train. 

5.7. Online NG-SGD Matters 

Table|^compares plain SGD with NG-SGD in model averag¬ 
ing mode, and it shows NG-SGD is crucial to model training 
with parameter-averaging. 


frequency 

Speedup 

SWB 

CallHome 

SWB 

CallHome 

baseline 

- 

14.7 

26.8 

Nodes 

1 

16 

1 

16 

10 

9.32 

15.1 

27.0 

SGD 

14.9 

16.3 

26.9 

28.3 

20 

10.07 

15.8 

28.0 

NG-SGD 

14.7 

15.1 

26.8 

27.0 


Table 3. Comparing different averaging frequencies 


Table 6. Comparing NG-SGD and naive SGD 
























































6. CONCLUSION AND FUTURE WORK 

In this work, we show that neural network training can be effi¬ 
ciently speeded up using model averaging, on a 300h Switch¬ 
board dataset, a 9.3x / 17x speedup could be achieved using 
16 / 32 GPUs respectively, with limited decoding accuracy 
loss. We also show that model averaging benehts a lot from 
NG-SGD and RBM based pretraining. Preliminary experi¬ 
ments on minibatch size, averaging frequency and learning 
rate schedules are also presented. 

Further accuracy improvement might be achieved if par¬ 
allel training runs on top of serial training initialization. It 
would be interesting to see if sequence-discriminative training 
combines well with model averaging. Speedup factor could 
be further improved if CUBA aware MPI is used. Theory on 
convergence using model averaging is to be explored, which 
might be useful for guiding future development. 
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